INTERSPEECH.2006 - Speech Recognition | Cool Papers

#1 Feature normalization using smoothed mixture transformations [PDF] [Copy] [Kimi]

Authors: Patrick Kenny ; Vishwa Gupta ; G. Boulianne ; Pierre Ouellet ; Pierre Dumouchel

We propose a method for estimating the parameters of SPLICE-like transformations from individual utterances so that this type of transformation can be used to normalize acoustic feature vectors for speech recognition on an utterance-by-utterance basis in a similar manner to cepstral mean normalization. We report results on an in-house French language multi-speaker database collected while deploying an automatic closed-captioning system for live broadcast news. An unusual feature of this database is that there are very large amounts of training data for the individual speakers (typically several hours) so that it is very difficult to improve on multi-speaker modeling by using standard methods of speaker adaptation. We found that the proposed method of feature normalization is capable of achieving a 6% relative improvement over cepstral mean normalization on this task.

#2 Stochastic vector mapping-based feature enhancement using prior model and environment adaptation for noisy speech recognition [PDF] [Copy] [Kimi]

Authors: Chia-Hsin Hsieh ; Chung-Hsien Wu ; Jun-Yu Lin

This paper presents an approach to feature enhancement for noisy speech recognition. Three prior models are introduced to characterize clean speech, noise and noisy speech respectively using sequential noise estimation based on noise-normalized stochastic vector mapping. Environment adaptation is also adopted to reduce the mismatch between training data and test data. For AURORA2 database, the experimental results indicate that a 0.77% digit accuracy improvement for multi-condition training and 0.29% digit accuracy improvement for clean speech training were achieved without stereo training data compared to the SPLICE-based approach with recursive noise estimation. For MAT-BN Mandarin broadcast news database, a (2).6% syllable accuracy improvement for anchor speech and 4.2% syllable accuracy improvement for field report speech were obtained compared to the MCE-based approach.

#3 A framework for robust MFCC feature extraction using SNR-dependent compression of enhanced mel filter bank energies [PDF] [Copy] [Kimi]

Authors: Babak Nasersharif ; Ahmad Akbari

The Mel-frequency cepstral coefficients (MFCC) are most widely used and successful features for speech recognition. But, their performance degrades in presence of additive noise. In this paper, we propose a noise compensation method for Mel filter bank energies and so MFCC features. This compensation method includes two steps: Mel sub-band spectral subtraction and then compression of Mel-Sub-band energies. In the compression step, we propose a sub-band SNR-dependent compression function. We use this function instead of logarithm function in conventional MFCC feature extraction in presence of additive noise. Experimental results show that the proposed method significantly improves MFCC features performance in noisy conditions where it decreases word error rate about 70% in SNR value of 0 dB for different types of additive noise.

#4 Coupling particle filters with automatic speech recognition for speech feature enhancement [PDF] [Copy] [Kimi]

Authors: Friedrich Faubel ; Matthias Wölfel

This paper addresses robust speech feature extraction in combination with statistical speech feature enhancement and couples the particle filter to the speech recognition hypotheses. To extract noise robust features the Fourier transformation is replaced by the warped and scaled minimum variance distortionless response spectral envelope. To enhance the features, particle filtering has been used. Further, we show that the robust extraction and statistical enhancement can be combined to good effect. One of the critical aspects in particle filter design is the particle weight calculation which is traditionally based on a general, time independent speech model approximated by a Gaussian mixture distribution. We replace this general, time independent speech model by time- and phoneme-specific models. The knowledge of the phonemes to be used is obtained by the hypothesis of a speech recognition system, therefore establishing a coupling between the particle filter and the speech recognition system which have been treated as independent components in the past.

#5 Extension and further analysis of higher order cepstral moment normalization (HOCMN) for robust features in speech recognition [PDF] [Copy] [Kimi]

Authors: Chang-wen Hsu ; Lin-shan Lee

Cepstral normalization has been popularly used as a powerful approach to produce robust features for speech recognition. Good examples of approaches include the well known Cepstral Mean Subtraction (CMS) and Cepstral Mean and Variance Normalization (CMVN), in which either the first or both the first and the second moments of the Mel-frequency Cepstral Coefficients (MFCCs) are normalized [1, 2]. Such approaches were extended previously to Higher Order Cepstral Moment Normalization (HOCMN) for normalizing moments with orders much higher than two [3]. Here we further extend HOCMN to a more generalized form with the generalized moment with non-integer orders defined in this paper. Extensive experimental results based on a newly defined development set for AURORA 2.0 indicated that not only HOCMN for integer moment orders can perform significantly better than the well-known approach of Histogram Equalization (HEQ), but some further improvements can be consistently obtained for almost all SNR values with non-integer moment orders. The theoretical foundation behind the approaches proposed here which explains why HOCMN can perform well and how the statistical properties of the distributions of the MFCC parameters are adjusted during the normalization processes were also discussed.

#6 An improved mel-wiener filter for mel-LPC based speech recognition [PDF] [Copy] [Kimi]

Authors: Md. Babul Islam ; Hiroshi Matsumoto ; Kazumasa Yamamoto

We previously proposed a Mel-Wiener filter to enhance Mel-LPC spectra in presence of additive noise. The proposed filter was estimated based on minimization of sum of square error on the linear frequency scale and efficiently implemented in the autocorrelation domain without denoising input speech. In the previously proposed system we segregated speech and noise using an energy based VAD and a very simple flooring technique were used for noise segment. In this present work, we improve the VAD using autoregressive (AR) model of noise and flooring technique as well. In addition, a lag window is applied to the estimated noise autocorrelation function to smooth the fine spectra of high order autocorrelation coefficients. As a result, substantial improvement is obtained over previous result.

#7 Computer-assisted closed-captioning of live TV broadcasts in French [PDF] [Copy] [Kimi]

Authors: G. Boulianne ; J.-F. Beaumont ; M. Boisvert ; J. Brousseau ; P. Cardinal ; C. Chapdelaine ; M. Comeau ; Pierre Ouellet ; F. Osterrath

Growing needs for French closed-captioning of live TV broadcasts in Canada cannot be met only with stenography-based technology because of a chronic shortage of skilled stenographers. Using speech recognition for live closed-captioning, however, requires several specific problems to be solved, such as the need for low-latency real-time recognition, remote operation, automated model updates, and collaborative work. In this paper we describe our solutions to these problems and the implementation of a live captioning system based on the CRIM speech recognizer. We report results from field deployment in several projects. The oldest in operation has been broadcasting real-time closed-captions for more than 2 years.

#8 On the use of morphological analysis for dialectal Arabic speech recognition [PDF] [Copy] [Kimi]

Authors: Mohamed Afify ; Ruhi Sarikaya ; Hong-Kwang Jeff Kuo ; Laurent Besacier ; Yuqing Gao

Arabic has a large number of affixes that can modify a stem to form words. In automatic speech recognition (ASR) this leads to a high outof- vocabulary (OOV) rate for typical lexicon size, and hence a potential increase in WER. This is even more pronounced for dialects of Arabic where additional affixes are often introduced and the available data is typically sparse. To address this problem we introduce a simple word decomposition algorithm which only requires a text corpus and a predefined list of affixes. Using this algorithm to create the lexicon for Iraqi Arabic ASR results in about 10% relative improvement in word error rate (WER). Also using the union of the segmented and unsegmented vocabularies and interpolating the corresponding language models results in further WER reduction. The net WER improvement is about 13% relative.

#9 Recognition of classroom lectures in european portuguese [PDF] [Copy] [Kimi]

Authors: Isabel Trancoso ; Ricardo Nunes ; Luís Neves ; Céu Viana ; Helena Moniz ; Diamantino Caseiro ; Ana Isabel Mata

Classroom lectures may be very challenging for automatic speech recognizers, because the vocabulary may be very specific and the speaking style very spontaneous. Our first experiments using a recognizer trained for Broadcast News resulted in word error rates near 60%, clearly confirming the need for adaptation to the specific topic of the lectures, on one hand, and for better strategies for handling spontaneous speech. This paper describes our efforts in these two directions: the different domain adaptation steps that lowered the error rate to 45%, with very little transcribed adaptation material, and the exploratory study of spontaneous speech phenomena in European Portuguese, namely concerning filled pauses.

#10 Investigating automatic decomposition for ASR in less represented languages [PDF] [Copy] [Kimi]

Authors: Thomas Pellegrini ; Lori Lamel

This paper addresses the use of an automatic decomposition method to reduce lexical variety and thereby improve speech recognition of less well-represented languages. The Amharic language has been selected for these experiments since only a small quantity of resources are available compared to well-covered languages. Inspired by the Harris algorithm, the method automatically generates plausible affixes, that combined with decompounding can reduce the size of the lexicon and the OOV rate. Recognition experiments are carried out for four different configurations (full-word and decompounded) and using supervised training with a corpus containing only two hours of manually transcribed data.

#11 Automatic transcription of Somali language [PDF] [Copy] [Kimi]

Authors: Abdillahi Nimaan ; Pascal Nocéra ; Jean-François Bonastre

Most African countries follow an oral tradition system to transmit their cultural, scientific and historic heritage through generations. This ancestral knowledge accumulated during centuries is today threatened of disappearing. Automatic transcription and indexing tools seem potential solution to preserve it. This paper presents the first steps of automatic speech recognition (ASR) of Djibouti languages in order to index the Djibouti cultural heritage. This work is dedicated to process Somali language, which represents half of the targeted Djiboutian audio archives. We describe the principal characteristics of audio (10 hours) and textual (3M words) training corpora collected and the first ASR results of this language. Using the specificities of the Somali language, (words are composed of a concatenation of sub-words called "roots" in this paper), we improve the obtained results. We also discuss future ways of research like roots indexing of audio archives.

#12 Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition [PDF] [Copy] [Kimi]

Authors: Özgür Çetin ; Elizabeth Shriberg

In previous work we found that automatic speech recognition (ASR) results on meetings show interesting patterns with respect to speaker overlaps, including a robust asymmetry in word error rates (WERs) before and after overlaps. The paradigm used allowed us to infer that these correlations are not due to crosstalk itself but to changes in how a person speaks around overlap regions. To better understand these ASR and perplexity results, we analyze speaker overlaps with respect to various factors, including collection site, speakers, dialog acts, and hot spots.

#13 Improving speech recognition of two simultaneous speech signals by integrating ICA BSS and automatic missing feature mask generation [PDF] [Copy] [Kimi]

Authors: Ryu Takeda ; Shun'ichi Yamamoto ; Kazunori Komatani ; Tetsuya Ogata ; Hiroshi G. Okuno

Robot audition systems require capabilities for sound source separation and the recognition of separated sounds, since we hear a mixture of sounds in our daily lives, especially mixed of speech. We report a robot audition system with a pair of omni-directional microphones embedded in a humanoid that recognizes two simultaneous talkers. It first separates the sound sources by Independent Component Analysis (ICA) with the single-input multiple-output (SIMO) model. Then, spectral distortion in the separated sounds is then estimated to generate missing feature masks. Finally, the separated sounds are recognized by missing-feature theory (MFT) for Automatic Speech Recognition (ASR). The novel aspects of our system involve estimates of spectral distortion in the temporal-frequency domain in terms of feature vectors and based on estimates error in SIMO-ICA signals. The resulting system outperformed the baseline robot audition system by 7%.

#14 Missing-feature reconstruction for band-limited speech recognition in spoken document retrieval [PDF] [Copy] [Kimi]

Authors: Wooil Kim ; John H. L. Hansen

In spoken document retrieval, it is necessary to support a variety of audio corpora from sources that have a range of conditions (e.g., channels, microphones, noise conditions, recording media, etc.). Varying band-limited speech represents one of the most challenging factors for robust speech recognition. The missing-feature reconstruction method shows the effectiveness in recognition of the speech corrupted by additive noise. However, it has a problem when applied to the band-limited speech reconstruction, since it assumes that the observations in the unreliable regions are always greater than the latent original clean speech. In this study, we propose to modify the current way to calculate the marginal probability for reconstruction into the computation depending only on the reliable components. To detect the cut-off regions from incoming speech, the blind mask estimation scheme is proposed, which employs the synthesized band-limited speech model without training database. Experimental results on Aurora 2.0 and actual band-limited speech (NGSW corpus) indicate that the proposed method is effective in improving recognition accuracy of the band-limited speech. Through combining with an adaptation method, 22.17% of relative improvement is obtained on NGSW.

#15 Incremental learning of MAP context-dependent edit operations for spoken phone number recognition in an embedded platform [PDF] [Copy] [Kimi]

Authors: Hahn Koo ; Yan Ming Cheng

Error-corrective post-processing (ECPP) has great potential to reduce speech recognition errors beyond that obtained by speech model improvement. ECPP approaches aim to learn error-corrective rules to directly reduce speech recognition errors. This paper presents our investigation into one such approach, incremental learning of maximum a posteriori (MAP) context-dependent edit operations. Limiting our dataset to spoken telephone number recognition output, we have evaluated this approach in an automotive environment using an embedded speech recognizer in a mobile device. We have found that a reduction of approximately 44¡«49% in speech recognition string errors can be achieved after learning.

#16 Development and evaluation of speech database in automotive environments for practical speech recognition systems [PDF] [Copy] [Kimi]

Authors: Yasunari Obuchi ; Nobuo Hataoka

Aiming at practical speech recognition systems, we are developing speech databases representing the situation in which the application is used, and evaluating various techniques using the database. Such methodology is expected to contribute to bridge the expectations of the developers and the reactions of the users. We start with the applications in automotive environments, or car navigation systems more precisely. During the data collection, special attention was paid to maintain the spontaneousness of the speaker, to cover failed utterances, and to use the hardware setup suitable for microphone array techniques. After the database is prepared, various techniques are evaluated. In some cases, oracle information is used to find the upper limit of the improvement of a specific module. In other cases, typical improving algorithms are tested. Recognition experiments using two separate decoders indicate that endpoint detection, feature normalization, speaker adaptation, and parallel decoding are promising fields. We also present some modifications of parallel decoding to reduce the computational cost and to realize practical applications.

#17 An effective and efficient utterance verification technology using word n-gram filler models [PDF] [Copy] [Kimi]

Authors: Dong Yu ; Yun-Cheng Ju ; Alex Acero

In this paper we propose a novel, effective, and efficient utterance verification (UV) technology for access control in the interactive voice response (IVR) systems. The key of our approach is to construct a context-free grammar by using the secret answer to a question and a word N-gram based filler model. The N-gram filler provides rich alternatives to the secret answer and can potentially improve the accuracy of the UV task. It can also absorb carrier words used by callers and thus can improve the robustness. We also propose using a predictor based on the best alternative to calculate the confidence. We show detailed experimental results on a tough UV test set that contains 930 positive and 930 negative cases and discuss types of questions that are suitable for the UV task. We demonstrate that our approach can achieve a 2.14% equal error rate (EER) on average and 0.8% false accept rate if the false reject rate is 2.6% and above. This is a 49% EER reduction compared with the approaches using acoustic fillers, and a 72% EER reduction compared with the posterior probability based confidence measurement.

#18 An efficient bispectrum phase entropy-based algorithm for VAD [PDF] [Copy] [Kimi]

Authors: J. M. Górriz ; Javier Ramírez ; C. G. Puntonet ; José C. Segura

In this paper we propose a novel Voice Activity Detection (VAD) algorithm, based on the integrated bispectrum function (IBI), for improving Automated Speech Recognition (ASR) systems that work in noisy environments. In particular we use the combination of two features, IBI magnitude and IBI phase to formulate a robust and smoothed decision rule for speech/pause discrimination. The analysis performed on the new combined feature highlighted: i) the advantages of each individual feature, while compensating the drawback of each other, and ii) the higher ability for endpoint detection given by a lower variance of the decision function in pause/speech frames. The experiments conducted on the Spanish SpeechDat-Car database showed that the proposed algorithm outperforms ITU G.729, ETSI AMR1 and AMR2 and ETSI AFE standards as well as other recently reported VAD methods in speech/non-speech detection performance.

#19 Two-step unsupervised speaker adaptation based on speaker and gender recognition and HMM combination [PDF] [Copy] [Kimi]

Authors: Petr Cerva ; Jan Nouza ; Jan Silovsky

In this paper, we present a new strategy for unsupervised speaker adaptation. In our approach, the adaptation is performed in two steps for each test utterance. In the first online step, we utilize speaker and gender identification, a set of speaker dependent (SD) hidden Markov models (HMMs) and our own fast linear model combination approach to create a proper model for the first speech recognition pass. After that the recognized phonetic transcription of the utterance is used for maximum likelihood (ML) estimation of more accurate weights for the final model combination step. Our experimental results on different types of broadcast programs show that the proposed method is capable to reduce the word error rate (WER) relatively by more than 17%.

#20 CENSREC2: corpus and evaluation environments for in car continuous digit speech recognition [PDF] [Copy] [Kimi]

Authors: Satoshi Nakamura ; Masakiyo Fujimoto ; Kazuya Takeda

This paper introduces a common database and an evaluation framework for connected digit speech recognition in real driving car environments, CENSREC-2, as an outcome of IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group. Speech data of CENSREC-2 was collected using two microphones, a close-talking microphone and a hands-free microphone, under three car speeds and four car conditions. CENSREC-2 provides four evaluation environments which are designed using speech data collected in these car conditions.

#21 Detection of word fragments in Mandarin telephone conversation [PDF] [Copy] [Kimi]

Authors: Cheng-Tao Chu ; Yun-Hsuan Sung ; Yuan Zhao ; Daniel Jurafsky

We describe preliminary work on the detection of word fragments in Mandarin conversational telephone speech. We extracted prosodic, voice quality, and lexical features, and trained Decision Tree and SVM classifiers. Previous research shows that glottalization features are instrumental in English fragment detection. However, we show that Mandarin fragments are quite different than English; 90% of Mandarin fragments are followed immediately by a repetition of the fragmentary word. These repetition fragments are not glottalized, and they have a very specific distribution; the 12 most frequent words ("you", "I", "that", "have", "then", etc.) cover 50% of the tokens of these fragments. Thus rather than glottalization, we found the most useful feature for Mandarin fragment detection was the identity of the neighboring character (word or morpheme). In an oracle experiment using the true (reference) neighboring words as well as prosodic and voice quality features, we achieved 80% accuracy in Mandarin fragment detection.

#22 A DTW-based dissimilarity measure for left-to-right hidden Markov models and its application to word confusability analysis [PDF] [Copy] [Kimi]

Authors: Qiang Huo ; Wei Li

We propose a dynamic time-warping (DTW) based distortion measure for measuring the dissimilarity between pairs of left-to-right continuous density hidden Markov models with state observation densities being mixture of Gaussians. The local distortion score required in DTW is defined as an approximate Kullback-Leibler divergence (KLD) between two Gaussian mixture models (GMMs). Several approximate KLDs are studied and compared for pairs of GMMs with different properties, and one of them is identified for being used in our DTWbased HMM dissimilarity measure. In an experiment of identifying automatically the subsets of confusable Putonghua (Mandarin Chinese) syllables, it is observed that the result based on the proposed HMM dissimilarity measure is highly consistent with the one based on syllable recognition confusion matrix obtained on a testing data set.

#23 Multi-flow block interleaving applied to distributed speech recognition over IP networks [PDF] [Copy] [Kimi]

Authors: Angel M. Gómez ; Juan J. Ramos-Muñoz ; Antonio M. Peinado ; Victoria Sánchez

Interleaving has shown to be a useful technique to provide robust distributed speech recognition over IP networks. This is due to its ability to disperse consecutive losses. However, this ability is related to the delay introduced by the interleaver. In this work, we propose a novel multi-flow block interleaver which exploits the presence of several streams and allows to reduce the involved delay. Experimental results have shown that this interleaver approximates the performance of end-to-end interleavers but with a fraction of their delay. As disadvantage, this interleaver must be placed in a common node where more than one flow are available.

#24 Moving speech recognition from software to silicon: the in silico vox project [PDF] [Copy] [Kimi]

Authors: Edward C. Lin ; Kai Yu ; Rob A. Rutenbar ; Tsuhan Chen

To achieve much faster decoding, or much lower power consumption, we need to liberate speech recognition from the artificial constraints of its current software-only form, and move the essential computations directly into silicon. There are vast efficiencies waiting to be unlocked in this application - we need the proper architecture to do so. We report results from a first-generation hardware architecture simulated at bit-level, and a complete, working FPGA-based prototype. Simulation results show that rather modest hardware designs, running 10-20X slower than conventional processors, can already decode at 0.6 xRT, running the standard 5K Wall Street Journal benchmark.

#25 A study on detection based automatic speech recognition [PDF] [Copy] [Kimi]

Authors: Chengyuan Ma ; Yu Tsao ; Chin-Hui Lee

We propose a new approach to automatic speech recognition based on word detection and knowledge-based verification. Given an utterance, we first design a collection of word detectors, one for each lexical item in the vocabulary. Some pruning strategies are used to eliminate unlikely word candidates. Then these detected words are combined into word strings. The proposed approach is different from the conventional maximum a posteriori decoding method, and it is a critical component in building a bottom-up, detection-based speech recognition system in which knowledge in acoustics, speech and language can easily be incorporated into pruning unlikely word hypotheses and rescoring. The proposed approach was evaluated on a connected digit task using phone models trained from the TIMIT corpus. When compared with state-of-the-art connected digit recognition algorithms, we found the proposed detection based framework works well even no digit samples were used for training the detectors and recognizers. ?Other knowledge based constraints, such as manner and place of articulation detectors, can be incorporated into this detection-based approach to improve the robustness and performance of the overall system.